An External Memory Approach to Compute the Statistics of Maximal Repeats from Whole Genome Sequences

نویسنده

  • Jing-Doo Wang
چکیده

The objective of this paper is to develop an external memory approach to extract the maximal repeats from whole genome sequences with the statistics of these repeats across classes, where the definition of a class is determined on what kind of statistics one wants to compute. We proposed a heuristic method consisted of a bucketsort-like approach and the Chinese term extraction approach. The former was used to sort the suffixes of DNA sequences stored in files and the later was used to extract maximal repeats by scanning the sorted suffixes while computing the statistics of these repeats. The statistics of these repeats across classes might be useful for sequence classification and species identification.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

RepMaestro: scalable repeat detection on disk-based genome sequences

MOTIVATION We investigate the problem of exact repeat detection on large genomic sequences. Most existing approaches based on suffix trees and suffix arrays (SAs) are limited either to small sequences or those that are memory resident. We introduce RepMaestro, a software that adapts existing in-memory-enhanced SA algorithms to enable them to scale efficiently to large sequences that are disk re...

متن کامل

Exhaustive Computation of Exact duplications via Super and Non-Nested Local Maximal repeats

We propose and implement a method to obtain all duplicated sequences (repeats) from a chromosome or whole genome. Unlike existing approaches our method makes it possible to simultaneously identify and classify repeats into super, local, and non-nested local maximal repeats. Computation verification demonstrates that maximal repeats for a genome of several gigabases can be identified in a reason...

متن کامل

Searching the genome of beluga(Husohuso) for sex markers based on targeted Bulked SegregantAnalysis (BSA)

In sturgeon aquaculture, where the main purpose is caviar production, a reliable method is needed to separate fish according to gender. Currently, due to the lack of external sexual dimorphism, the fish are sexed by an invasive surgical examination of the gonads. Development of a non-invasive procedure for sexing fish based on genetic markers is of special interest. In the present study we empl...

متن کامل

Efficient computation of all perfect repeats in genomic sequences of up to half a gigabyte, with a case study on the human genome

MOTIVATION There is a significant ongoing research to identify the number and types of repetitive DNA sequences. As more genomes are sequenced, efficiency and scalability in computational tools become mandatory. Existing tools fail to find distant repeats because they cannot accommodate whole chromosomes, but segments. Also, a quantitative framework for repetitive elements inside a genome or ac...

متن کامل

Profile of Eight Prophage Sequences Present in the Genomes of Different Acinetobacter baumannii Strains

ABSTRACT           Background and Objective: Prophage sequences are major contributors to interstrain variations within the same bacterial species. Acinetobacter baumannii is a gram-negative bacterium that causes a wide range of nosocomial infections, especially in intensive care unit inpatients. Prophage sequences constitute a considerable proporti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005